The search functionality is under construction.

Keyword Search Result

[Keyword] fault tolerance(100hit)

41-60hit(100hit)

  • PREGMA: A New Fault Tolerant Cluster Using COTS Components for Internet Services

    Takeshi MISHIMA  Takeshi AKAIKE  

     
    PAPER-Dependable Systems

      Vol:
    E86-D No:12
      Page(s):
    2517-2526

    We propose a new dependable system called PREGMA (Platform for Reliable Environment based on a General-purpose Machine Architecture). PREGMA aims to meet two requirements -- fault tolerance and low cost -- for Internet services. It can provide fault tolerance, so we can avoid system failure and prevent data corruption, even if faults occur. That is, it masks the faults by running multiple replicated servers, each possessing its own data, in a loosely synchronized manner and delivering the majority vote as output to clients. Moreover, PREGMA is composed of COTS (Commercial Off-The-Shelf) components without modification, which makes it possible to offer the services at a low cost. We investigated two approaches for achieving redundancy of the Coordinator, which is the core of PREGMA: using the primary backup method and the active replication method. We evaluated the effectiveness of PREGMA in terms of throughput overhead, data integrity and recovery time. The results for a prototype show that PREGMA using the Coordinator with the primary backup method outperforms that with the active replication method and has throughput only 3% lower than a non-redundant system. The results also show that, in the event of failure, the recovery time is only less than one second and no data corruption occurs.

  • A Novel Learning Algorithm Which Makes Multilayer Neural Networks Multiple-Weight-Fault Tolerant

    Itsuo TAKANAMI  Yasuhiro OYAMA  

     
    PAPER-Dependable Systems

      Vol:
    E86-D No:12
      Page(s):
    2536-2543

    We propose an efficient algorithm for making multi-layered neural networks (MLN) fault-tolerant to all multiple weight faults in a multi-dimensional interval by injecting intentionally two extreme multi-dimensional values in the interval into the weights of the selected multiple links in a learning phase. The degree of fault-tolerance to a multiple weight fault is measured by the number of essential multiple links. First, we analytically discuss how to choose effectively the multiple links to be injected, and present a learning algorithm for making MLNs fault tolerant to all multiple (i.e., simultaneous) faults in the interval defined by two multi-dimensional extreme points. Then it is proved that after the learning algorithm successfully finishes, MLNs become fault tolerant to all multiple faults in the interval. It is also shown that the time in a weight modification cycle depends little on multiplicity of faults k for small k. These are confirmed by simulation.

  • A Three-tier Active Replication Protocol for Large Scale Distributed Systems

    Carlo MARCHETTI  Sara Tucci PIERGIOVANNI  Roberto BALDONI  

     
    PAPER-Dependable Software

      Vol:
    E86-D No:12
      Page(s):
    2544-2552

    The deployment of server replicas of a service across an asynchronous distributed system (e.g., Internet) is a real practical challenge. This target cannot be indeed achieved by classical software replication techniques (e.g., passive and active replication) as these techniques usually rely on group communication toolkits that require server replicas to run over a partially synchronous distributed system to solve the underlying agreement problem. This paper proposes a three-tier architecture for software replication that encapsulates the need of partial synchrony in a specific software component of a mid-tier to free replicas and clients from the need of underlying partial synchrony assumptions. Then we propose how to specialize the mid-tier in order to manage active replication of server replicas.

  • An Algorithm for Node-to-Set Disjoint Paths Problem in Burnt Pancake Graphs

    Keiichi KANEKO  

     
    PAPER-Dependable Communication

      Vol:
    E86-D No:12
      Page(s):
    2588-2594

    A burnt pancake graph is a variant of Cayley graphs and its topology is suitable for massively parallel systems. However, for a burnt pancake graph, there is much room for further research. Hence, in this study, we focus on n-burnt pancake graphs and propose an algorithm to obtain n disjoint paths from a source node to n destination nodes in polynomial order time of n, n being the degree of the graph. In addition, we estimate the time complexity of the algorithm and the sum of path lengths. We also give a proof of correctness of the algorithm. Moreover, we report the results of computer simulation to evaluate the average performance of the algorithm.

  • Consideration of Fault Tolerance in Autonomic Computing Environment

    Yoshihiro TOHMA  

     
    INVITED PAPER

      Vol:
    E86-D No:12
      Page(s):
    2503-2507

    Since the characteristic to current information systems is the dynamic and concurrent change of their configurations and scales with non-stop provision of their services, the system management should inevitably rely on autonomic computing. Since fault tolerance is the one of important system management issues, it should also be incorporated in autonomic computing environment. This paper argues what should be taken into consideration and what approach could be available to realize the fault tolerance in such environments.

  • Dependability Evaluation with Fault Injection Experiments

    Piotr GAWKOWSKI  Janusz SOSNOWSKI  

     
    PAPER-Verification and Dependability Analysis

      Vol:
    E86-D No:12
      Page(s):
    2642-2649

    In the paper we evaluate program susceptibility to hardware faults using fault injector. The performed experiments cover many applications with different features. The effectiveness of software techniques improving system dependability is analyzed. Practical aspects of embedding these techniques in real programs are discussed. They have significant impact on the final fault robustness.

  • QoS Certification of Real-Time Distributed Computing Systems: Issues and Promising Approaches

    K.H. (Kane) KIM  

     
    INVITED PAPER

      Vol:
    E86-D No:10
      Page(s):
    2077-2086

    The general public is expected to demand in not too distant future instituting more stringent certification procedures for computing parts of traditional and new-generation safety-critical application systems. Such quality-of-service (QoS) certification processes will not and can not rely solely on the testing approach. Design-time guaranteeing of timely service capabilities of various subsystems is an inevitable part of such processes. Although some promising developments in this area have been occurring in recent years, the technological challenges yet to be overcome are enormous. This paper is a summary of the author's perspective on the remaining challenges and promising directions for tackling them.

  • Verifying Fault Tolerance of Concurrent Systems by Model Checking

    Tomoyuki YOKOGAWA  Tatsuhiro TSUCHIYA  Tohru KIKUNO  

     
    PAPER

      Vol:
    E85-A No:11
      Page(s):
    2414-2425

    Model checking is a technique that can make a verification for finite state systems absolutely automatic. We propose a method for automatic verification of fault-tolerant concurrent systems using this technique. Unlike other related work, which is tailored to specific systems, we are aimed at providing an approach that can be used to verify various kinds of systems against fault tolerance. The main obstacle in model checking is state explosion. To avoid the problem, we design this method so that it can use a symbolic model checking tool called SMV (Symbolic Model Verifier). Symbolic model checking can overcome the problem by expressing the state space and the transition relation by Boolean functions. Assuming that a system to be verified is modeled as a guarded command program, we design a modeling language and propose a translation method from the modeling language to the input language of SMV. We show the results of applying the proposed method to various examples to demonstrate the feasibility of the method.

  • Probabilistic Checkpointing

    Hyochang NAM  Jong KIM  Sung Je HONG  Sunggu LEE  

     
    PAPER-Fault Tolerance

      Vol:
    E85-D No:7
      Page(s):
    1093-1104

    For checkpointing to be practical, it has to introduce low overhead for the targeted application. As a means of reducing the overhead of checkpointing, this paper proposes a probabilistic checkpointing method, which uses block encoding to detect the modified memory area between two consecutive checkpoints. Since the proposed technique uses block encoding to detect the modified area, the possibility of aliasing exists in encoded words. However, this paper shows that the aliasing probability is near zero when an 8-byte encoded word is used. The performance of the proposed technique is analyzed and measured by using experiments. An analytic model which predicts the checkpointing overhead is first constructed. By using this model, the block size that produces the best performance for a given target program is estimated. In most cases, medium block sizes, i.e., 128 or 256 bytes, show the best performance. The proposed technique has also been implemented on Unix based systems, and its performance has been measured in real environments. According to the experimental results, the proposed technique reduces the overhead by 11.7% in the best case and increases the overhead by 0.5% in the worst case in comparison with page-based incremental checkpointing.

  • Analytical Evaluation of Internet Packet Loss Recovery Using Convolutional Codes

    Anna YAMAGUCHI  Masayuki ARAI  Satoshi FUKUMOTO  Kazuhiko IWASAKI  

     
    PAPER-Fault Tolerance

      Vol:
    E85-D No:5
      Page(s):
    854-863

    With increasing Internet traffic congestion, the provision of reliable transmission and packet loss recovery continues to be of substantial importance. In this paper, we analyze a new recovery method using punctured convolutional codes, demonstrating the simplicity and efficiency of the proposed method for the recovery of lost packets. The analysis provides a method for determining the recoverability and the post-reconstruction receiving rate for a given convolutional code. The exact expressions for calculating the recovery rate are derived for a number of convolutional codes and the (2, 1, m) punctured convolutional code. Where packet loss probabilities are in the range typically found in Internet transmissions, the convolutional code-based method delivers superior performance over the traditional parity method with the same redundancy.

  • Fault-Tolerance Design for Multicast Using Convolutional-Code-Based FEC and Its Analytical Evaluation

    Anna YAMAGUCHI  Masayuki ARAI  Hitoshi KUROSU  Satoshi FUKUMOTO  Kazuhiko IWASAKI  

     
    PAPER-Fault Tolerance

      Vol:
    E85-D No:5
      Page(s):
    864-873

    In this paper, we propose and analytically evaluate the use of punctured convolutional codes for recovering packets lost in multicast transmission. An independent erasure channel is assumed for packets transmission over a star topology. The analysis provides a method for determining the recoverability and the post-reconstruction receiving rate for a given convolutional code. We theoretically evaluate the effectiveness of the proposed approach taking into account two different parameters: the number of transmissions per packet and the number of packets needed to be sent to guarantee the reception of data. Finally, we compare the proposed approach with the scheme when parity packets are generated based on Reed-Solomon codes.

  • FT_HORB : A Fault-Tolerant Distributed Programming Environment Based on RMI

    Shik KIM  Muyong HYUN  Jiro YAMAKITA  

     
    PAPER-Computer Systems

      Vol:
    E85-D No:3
      Page(s):
    510-517

    In distributed systems, the provision for failure-recovery is always a hot design issue, whereas no fault-tolerant feature has been extensively considered in the current RMI, CORBA and other OODP environments. As a result, application developers have to implement their own fault tolerant mechanisms. In this paper, we propose a fault-tolerant development environment based on one kind of RMI, called FT_HORB, as a Java extension for the reliable distributed computing with checkpoints and rollback-recovery mechanism. The FT_HORB is implemented on the Sun Ultra10 workstations connected through a 100 Mbps network. We observe that experimental applications on the FT_HORB can continue their operations in spite of hardware and software failures. Three benchmark models such as the nqueens problem, the traveling salesman problem and the gaussian elimination problem are experimented with the FT_HORB to evaluate its performance. The results show the performance of FT_HORB is acceptable. In addition, experiments demonstrate its possibility of extension to fully support our optimal design goal.

  • Potential of Constructive Timing-Violation

    Toshinori SATO  Itsujiro ARITA  

     
    PAPER-High-Performance Technologies

      Vol:
    E85-C No:2
      Page(s):
    323-330

    This paper proposes constructive timing-violation (CTV) and evaluates its potential. It can be utilized both for increasing clock frequency and for reducing energy consumption. Increasing clock frequency over that determined by the critical paths causes timing violations. On the other hand, while supply voltage reduction can result in substantial power savings, it also causes larger gate delay and thus clock must be slow down in order not to violate timing constraints of critical paths. However, if any tolerant mechanisms are provided for the timing violations, it is not necessary to keep the constraints. Rather, the violations would be constructive for high clock frequency or for energy savings. From these observations, we propose the CTV, which is supported by the tolerant mechanism based on contemporary speculative execution mechanisms. We evaluate the CTV using a cycle-by-cycle simulator and present its considerably promising potential.

  • Reliable Data Routing for Spatial-Temporal TMR Multiprocessor Systems

    Mineo KANEKO  

     
    PAPER-Fault Tolerance

      Vol:
    E84-D No:12
      Page(s):
    1790-1800

    This paper treats the data routing problem for fault-tolerant systolic arrays based on Triple Modular Redundancy (TMR) in mixed spatial-temporal domain. The number of logical links required in TMR systolic array is basically 9 times larger than the one for corresponding non-fault-tolerant systolic array. The link sharing is a promising method for reducing the number of physical links, which may, however, degrade the fault tolerance of TMR system. This paper proposes several robust data-routing and resource-sharing (plural data transfers share a physical link, or a data transfer and a computational task share a PE as a relay node for the former and as a processor for the latter), by which certain classes of fault tolerant property will be guaranteed. A stage and a dominated set are introduced to characterize the features of routing/resource-sharing in TMR systems, and conditions on the dominated set and their resultant fault-tolerant properties are derived.

  • A System for Efficiently Self-Reconstructing 1(1/2)-Track Switch Torus Arrays

    Tadayoshi HORITA  Itsuo TAKANAMI  

     
    PAPER-Fault Tolerance

      Vol:
    E84-D No:12
      Page(s):
    1801-1809

    A mesh-connected processor array consists of many similar processing elements (PEs), which can be executed in both parallel and pipeline processing. For the implementation of an array of large numbers of processors, it is necessary to consider some fault tolerant issues to enhance the (fabrication-time) yield and the (run-time) reliability. In this paper, we introduce the 1(1/2)-track switch torus array by changing the connections in 1(1/2)-track switch mesh array, and we apply our approximate reconfiguration algorithm to the torus array. We describe the reconfiguration strategy for the 1(1/2)-track switch torus array and its realization using WSI, especially 3-dimensional realization. A hardware realization of the algorithm is proposed and simulation results about the array reliability are shown. These imply that a self-reconfigurable system with no host computer can be realized using our method, hence our method is effective in enhancing the run-time reliability as well as the fabrication-time yield of processor arrays.

  • Design of Fault Tolerant Multistage Interconnection Networks with Dilated Links

    Naotake KAMIURA  Takashi KODERA  Nobuyuki MATSUI  

     
    PAPER

      Vol:
    E84-D No:11
      Page(s):
    1500-1507

    In this paper we propose a MIN (Multistage Interconnection Network) whose performance in the faulty case degrades as gracefully as possible. We focus on a two-dilated baseline network as a sort of MIN. The link connection pattern in our MIN is determined so that all the available paths established between an input terminal and an output terminal via an identical input of a SE (Switching Element) in some stage will never pass through an identical SE in the next stage. Extra links are useful in improving the performance of the MIN and do not complicate the routing scheme. There is no difference between our MIN and others constructed from a baseline network with regard to numbers of links and cross points in all SEs. The theoretical computation and simulation-based study show that our MIN is superior to others in performance, especially in robustness against concentrated SE faults in an identical stage.

  • Fault-Tolerant Ring- and Toroidal Mesh-Connected Processor Arrays Able to Enhance Emulation of Hypercubes

    Nobuo TSUDA  

     
    PAPER

      Vol:
    E84-D No:11
      Page(s):
    1452-1461

    An advanced spare-connection scheme for K-out-of-N redundancy is proposed for constructing fault-tolerant ring- or toroidal mesh-connected processing-node arrays able to enhance emulation of binary hypercubes by using bypass networks. With this scheme, a component redundancy configuration for a base array with a fixed number of primary nodes, such as that for 8-node ring or 32-node toroidal mesh, can be constructed by using bypass links with a segmented bus structure to selectively connect the primary nodes to a spare node in parallel. These bypass links are allocated to the primary nodes by graph-node coloring with a minimum inter-node distance of three in order to use the bypass links as the hypercube connections as well as to attain strong fault tolerance for reconfiguring the base array with the primary network topology. An extended redundancy configuration for a large fault-tolerant array can be constructed by connecting the component configurations by using external switches of a hub type provided at the bus nodes of the bypass links. This configuration has a network topology of the parallel star-connections of sub-hypercubes whose diameter is smaller than that of the regular hypercube.

  • The Evolutionary Algorithm-Based Reasoning System

    Moritoshi YASUNAGA  Ikuo YOSHIHARA  Jung Hwan KIM  

     
    PAPER

      Vol:
    E84-D No:11
      Page(s):
    1508-1520

    In this paper, we propose the evolutionary algorithm-based reasoning system and its design methodology. In the proposed design methodology, reasoning rules behind the past cases in each task (in each case database) are extracted through genetic algorithms and are expressed as truth tables (we call them 'evolved truth tables'). Circuits for the reasoning systems are synthesized from the evolved truth tables. Parallelism in each task can be embedded directly in the circuits by the hardware implementation of the evolved truth tables, so that the high speed reasoning system with small or acceptable hardware size is achieved. We developed a prototype system using Xilinx Virtex FPGA chips and applied it to the gene boundary reasoning (GBR) and English pronunciation reasoning (EPR), which are very important practical tasks in the genome science and language processing field, respectively. The GBR and the EPR prototype systems are evaluated in terms of the reasoning accuracy, circuit size, and processing speed, and compared with the conventional approaches in the parallel AI and the artificial neural networks. Fault injection experiments are also carried out using the prototype system, and its high fault-tolerance, or graceful degradation against defective circuits that suits to the hardware implementation using wafer scale LSIs is demonstrated.

  • A Graph-Theoretic Approach to Minimizing the Number of Dangerous Processors in Fault-Tolerant Mesh-Connected Processor Arrays

    Itsuo TAKANAMI  

     
    PAPER

      Vol:
    E84-D No:11
      Page(s):
    1462-1470

    First, we give a graph-theoretic formalization for the spare assignment problems for two cases of reconfiguring NN mesh-connected processor arrays with spares on a diagonal line in the array or two orthogonal lines at the edges of the array. Second, we discuss the problems for minimizing the numbers of "dangerous processors" for the cases. Here, a dangerous processor is a nonfaulty one for which there remains no spare processor to be assigned if it becomes faulty, without modifying the spare assignments to other faulty processors. The problem for the latter case, originally presented by Melhem, has already been discussed and solved by the O(N2) algorithm in [3], but it's procedure is very complicated. Using the above graph-theoretic formalization, we give efficient plain algorithms for minimizing the numbers of dangerous processors by which the problems for both the cases can be solved in O(N) time.

  • A High Assurance On-Line Recovery Technology for a Space On-Board Computer

    Hiroyuki YASHIRO  Teruo FUJIWARA  Kinji MORI  

     
    PAPER-Issues

      Vol:
    E84-D No:10
      Page(s):
    1350-1359

    A high assurance on-line recovery technology for a space on-board computer that can be realized using commercial devices is proposed whereby a faulty processor node confirms its normality and then recovers without affecting the other processor nodes in operation. Also, the result of an evaluation test using the breadboard model implementing this technology is reported. Because this technology enables simple and assured recovery of a faulty processor node regardless of its degree of redundancy, it can be applied to various applications, such as a launch vehicle, a satellite, and a reusable launch vehicle. As a result, decreasing the cost of an on-board computer is possible while maintaining its high reliability.

41-60hit(100hit)